Automate corpora testing in CI#4927
Conversation
…al runs to get PR issue number
The bench uses --no-verification, so the engine's overlap-path dedup (which exists to protect verifiers from duplicate calls) adds noise without value here — it causes shifts in unrelated detectors when only one detector's regex changes. Pair --allow-verification-overlap with --no-verification so each detector's regex behavior is measured independently. Also fix the false 'no diff vs main' claim that triggered when NEW/REMOVED were zero but total counts differed.
awk's END block doesn't run when trufflehog exits before draining stdin (SIGPIPE kills awk first), leaving the bytes file empty and breaking the step with a `$((TOTAL_BYTES + ))` syntax error. Read the file with a default of 0 and validate it's an integer before arithmetic. Also fold unzstd/jq stderr into STDERR_FILE so benign Broken pipe notices stay out of CI logs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…pection Static AST parse of a detector package to extract the strings returned by its Keywords() method. Used by the upcoming keyword-corpus builder to fan out per-detector GitHub Code Search queries during the corpora bench. AST-first because each detector lives in its own package; importing them dynamically would require codegen or `plugin`. Falls back to a regex over the function body, then a directory-wide grep, when AST resolution can't statically resolve the return value (helper calls, build-tagged variants). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Corpora Test ResultsNo detector regex or keyword changes in this PR. Bench skipped. |
…urity/trufflehog into hackathon/detector-tests-in-ci
| SUM(CASE WHEN Verified AND VerificationError IS NULL THEN 1 ELSE 0 END) verified, | ||
| SUM(CASE WHEN NOT Verified AND VerificationError IS NULL THEN 1 ELSE 0 END) unverified, | ||
| SUM(CASE WHEN VerificationError IS NOT NULL THEN 1 ELSE 0 END) \"unknown\" |
There was a problem hiding this comment.
Running with --no-verification above makes these values not meaningful, right?
There was a problem hiding this comment.
Agreed! Thanks for catching this. Will remove
|
|
||
| for CORPORA_FILE in "$@"; do | ||
| if [[ "$CORPORA_FILE" == s3://* ]]; then | ||
| aws s3 cp "$CORPORA_FILE" - | scan /dev/stdin |
There was a problem hiding this comment.
Note: though stdin is likely fine for this kind of testing, the newer json-enumerator input source in TruffleHog might make this a bit more straightforward. That input source expects NDJSON where each value looks like this:
{"data": "utf8 string to scan", "metadata": <arbitrary JSON value>}
or this:
{"data_b64": "base64-encoded bytestring to scan", "metadata": <arbitrary JSON value>}
There was a problem hiding this comment.
Hmm, json-enumerator would give better chunk isolation and carry path metadata through. But in this context we're running with --no-verification and the metadata isn't used downstream, so the practical difference is minimal. Keeping stdin for now to avoid any unexpected behavioral changes, but happy to revisit if there's a specific benefit you had in mind.
| local rc=0 | ||
| if [[ -n "${TRUFFLEHOG_BIN_MAIN:-}" ]]; then | ||
| # Single S3 download teed to both binaries simultaneously. | ||
| unzstd -c "$input" 2>> "$STDERR_FILE" \ | ||
| | jq -r .content 2>> "$STDERR_FILE" \ | ||
| | tee >( | ||
| "${TRUFFLEHOG_BIN_MAIN}" \ | ||
| --no-update \ | ||
| --no-verification \ | ||
| --allow-verification-overlap \ | ||
| --log-level=3 \ | ||
| --concurrency=8 \ | ||
| --json \ | ||
| "${main_include_flag[@]}" \ | ||
| stdin >> "${OUTPUT_JSONL_MAIN}" 2>> "$STDERR_FILE" | ||
| ) \ | ||
| | "$TRUFFLEHOG_BIN" \ | ||
| --no-update \ | ||
| --no-verification \ | ||
| --allow-verification-overlap \ | ||
| --log-level=3 \ | ||
| --concurrency=8 \ | ||
| --json \ | ||
| --print-avg-detector-time \ | ||
| "${INCLUDE_FLAG[@]}" \ | ||
| stdin >> "$OUTPUT_JSONL" 2>> "$STDERR_FILE" | ||
| rc=$? | ||
| wait | ||
| else | ||
| unzstd -c "$input" 2>> "$STDERR_FILE" \ | ||
| | jq -r .content 2>> "$STDERR_FILE" \ | ||
| | "$TRUFFLEHOG_BIN" \ | ||
| --no-update \ | ||
| --no-verification \ | ||
| --allow-verification-overlap \ | ||
| --log-level=3 \ | ||
| --concurrency=8 \ | ||
| --json \ | ||
| --print-avg-detector-time \ | ||
| "${INCLUDE_FLAG[@]}" \ | ||
| stdin >> "$OUTPUT_JSONL" 2>> "$STDERR_FILE" | ||
| rc=$? | ||
| fi |
There was a problem hiding this comment.
When I last used trufflehog stdin like this, I found that it would timeout after 1 minute, leaving you with an incomplete scan when dealing with large inputs. The workaround was, confusingly, to specify --archive-timeout=6h (or some similarly large value).
You might want to check that the scan is not being terminated early! This is another reason why you might prefer the json-enumerator input source, which doesn't have this timeout gotcha.
There was a problem hiding this comment.
I don't think this is an issue in here. The tests consistently go 30+ minutes and successfully complete without truncation. I think it's because we decompress the corpus ourselves with unzstd before piping to stdin, so TruffleHog never sees a compressed archive and the archive timeout never applies.
| s3://trufflehog-corpora-datasets/contents.2025-11-04.jsonl.zstd | ||
| s3://trufflehog-corpora-datasets/contents.jsonl.zstd | ||
|
|
||
| jobs: |
There was a problem hiding this comment.
Note, some of the actions used here are old versions. Also, you might consider pinning the action versions used here to reduce risk of possible supply-chain attacks.
zizmor is helpful: https://docs.zizmor.sh/
There was a problem hiding this comment.
Thanks for this. Really helpful! I'll do the needful
| pull_request: | ||
| paths: | ||
| - 'pkg/detectors/**' | ||
| - 'pkg/engine/defaults/defaults.go' | ||
| - '.github/workflows/detector-corpora-test.yml' | ||
| - 'scripts/test/detector_corpora_test.sh' | ||
| - 'scripts/test/diff_corpora_results.py' | ||
| - 'scripts/test/detect_changed_detectors.sh' |
There was a problem hiding this comment.
Since the CPU work required by a single run of this workflow is pretty expensive (30+ minutes), is this something we want running automatically on pull requests (as it looks like it does here), or only as an opt-in workflow?
There was a problem hiding this comment.
Good question. The workflow is scoped to only trigger when a PR modifies regex patterns or Keywords() in a detector. Purely structural changes (verification logic, redaction, comments, etc.) are filtered out and skip the bench entirely. In practice this means it only runs on PRs that actually affect match behavior.
We also don't currently merge a detector without running this test manually, so automating it in CI seems like the right call. it ensures the check never gets skipped and gives reviewers the data they need without having to ask for it.
…e everything works as expected
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 6c3bbae. Configure here.
| "${INCLUDE_FLAG[@]}" \ | ||
| stdin >> "$OUTPUT_JSONL" 2>> "$STDERR_FILE" | ||
| rc=$? | ||
| wait |
There was a problem hiding this comment.
Main binary failure in process substitution silently ignored
Low Severity
In dual-binary mode, the main binary runs inside a tee >(...) process substitution whose exit code is never checked. If it crashes or exits early, wait runs under set +e so the failure is silently swallowed. Worse, if tee receives SIGPIPE from the broken process substitution pipe, it may also die, truncating input to the PR binary — producing silently incomplete results on both sides with no error surfaced.
Reviewed by Cursor Bugbot for commit 6c3bbae. Configure here.


Motivation
When adding or modifying a detector, the key question is: how much noise will this regex produce against real-world code? Too many false positives means alert fatigue; a regex that's too tight misses real secrets.
Previously this was a fully manual process — download a large corpus locally, run the pipeline, inspect the DuckDB output. There was no enforcement in CI, so it was easy to skip or forget, especially under time pressure.
This PR automates it. Every PR that touches detector code now gets a comment showing exactly how many unique matches the changed detector produces, compared to the main baseline. The comment updates in place on every push, so the PR timeline stays clean.
What it tells you
The bench scans a corpus of real-world public code (S3 datasets, ~35 GB in total compressed) using only the detectors changed in the PR, with verification disabled. It reports unique match counts for the PR build vs. the main baseline:
PRs that add a new detector will see a 🆕 row with an absolute match count and no baseline comparison.
Example output
What changed
.github/workflows/detector-corpora-test.yml— new workflow; triggers on PRs touchingpkg/detectors/**; detects which detectors changed, builds PR and main binaries in parallel, runs both scans against the corpus concurrently, posts a sticky comment with the diffscripts/test/detect_changed_detectors.sh— resolves changed detector directories to their proto enum names for--include-detectorsscoping; skips detectors whose diff doesn't touch regex patterns orKeywords()so PRs that only change verification or redaction logic don't trigger a bench runscripts/test/detector_corpora_test.sh— streams corpus files (S3 or local), runs trufflehog with--no-verification, outputs JSONLscripts/test/diff_corpora_results.py— diffs two JSONL result sets and renders the Markdown report posted to the PRPerformance
The naive implementation was slow. Two full scans (PR + main) each streamed the entire corpus independently from S3 — on a 35 GB dataset that meant two downloads, two decompressions, two jq passes, and two trufflehog runs, serialized per dataset file. First runs took ~50 minutes.
Three optimizations brought first runs down to ~38 minutes and subsequent pushes to the same PR down to ~30 minutes:
Main scan caching — The main binary scans the same commit (the merge base) on every push to a PR, as long as no rebase happens. We cache
/tmp/results-main.jsonlin GitHub Actions keyed bymerge-base SHA + scoped detector set. On subsequent pushes without a rebase, the entire main side (worktree checkout,go build, S3 download, scan) is skipped entirely.Single S3 stream with
tee— When the main scan does need to run (cache miss), both PR and main binaries now consume the same S3 stream viatee >(main_binary stdin). S3 is downloaded and decompressed once; both scans process the content in parallel at the OS pipe level.Scoped scanning — Only the detectors changed in the PR are passed via
--include-detectors. Scanning the full detector set against a large corpus is most of the work; scoping to 1–3 changed detectors cuts runtime proportionally.Fork PRs
Excluded for fork PRs — S3 credentials are not available to fork-originated workflows. Maintainers need to run this manually for forked PRs.
Running locally
Results land in
/tmp/corpora_results.jsonlwith a DuckDB summary table printed to stdout.Note
Medium Risk
Adds a new CI workflow that pulls large corpora from S3 using repository secrets and posts sticky PR comments, so failures/mis-scoping or credential issues can impact PR checks and leak operational signals (though no product runtime paths change).
Overview
Adds an automated detector corpora regression benchmark in CI that runs on PRs touching detector code and posts a sticky
<!-- detector-bench -->PR comment with unique-match deltas (PR vs merge-basemain) for the impacted detectors.Introduces scripts to (1) detect which detectors actually changed matching behavior (regex/
Keywords()), (2) stream one or more zstd JSONL corpora files (local ors3://) throughtrufflehogwith--include-detectorsand--no-verification, and (3) diff the two JSONL outputs into a Markdown report (including handling for new detectors with no baseline).Optimizes CI runtime by caching the baseline scan keyed by merge-base SHA + detector set, building
mainand PR binaries in parallel, and teeing a single corpus stream to both binaries when a baseline scan is needed; also updates.gitignoreto ignore Python bytecode artifacts.Reviewed by Cursor Bugbot for commit 6c3bbae. Bugbot is set up for automated code reviews on this repo. Configure here.